Tritonプログラミング入門：スレッドからプログラムインスタンスへの移行

Tritonでは、実行の基本単位がCUDAスカラースレッドからプログラムインスタンスに移行します。これは、1つのインスタンスが同時にベクトル化された「ブロック」の要素を処理するGPUスレッドブロックの抽象表現です。

1. プログラムインスタンスの識別子

すべての実行ユニットは、 pid = tl.program_id(axis=0)によって自身の識別子を取得します。 倉庫のフォークリフト （プログラムインスタンス）が パレット 128個の箱の束（ブロック）を持ち上げるのと、1人の作業者（CUDAスレッド）が1箱ずつ持ち上げるのを比較してください。

2. TritonとPyTorchテンソルの比較

メモリ管理において、この意味的なギャップを理解することは重要です：

PyTorchテンソル： ホスト側のPythonオブジェクトで、VRAMのストレージ、ストライド、メタデータをラップしています。
Tritonテンソル： コンパイラレベルのオブジェクトで、 レジスタまたはSRAMに格納されている値やポインタを表します。

PyTorchビュー
連続したグローバルメモリを指すPythonオブジェクト。

Tritonビュー
コンパイラのレジスタ内の2次元／1次元のデータブロック。

3. SPMDの性質

Tritonは 単一プログラム・複数データ（SPMD） のフローに従います。すべてのプログラムインスタンスは まったく同じ コードを実行します。分岐は、論理が pid を使って特定のメモリオフセットを計算する場合にのみ発生します。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary identifier for a Triton execution unit?

threadIdx.x

tl.program_id(axis=0)

tl.block_idx()

torch.get_id()

QUESTION 2

True or False: A Triton tensor is a Python object that stores metadata like strides on the host CPU.

True

False

QUESTION 3

What is the result of 'forgetting that all program instances execute the same kernel body'?

The compiler will automatically distribute tasks.

Race conditions or overwriting memory if pid-based logic is missing.

The kernel will fail to compile due to a syntax error.

Execution time will double.

QUESTION 4

In the forklift analogy, what does the 'Aisle Number' represent?

The BLOCK_SIZE

The program_id (pid)

The GPU Driver version

The Pointer address

QUESTION 5

Why is the Triton model considered 'Vectorized' compared to CUDA?

It uses Python lists.

One Program Instance handles a block of elements, not just one scalar element.

It only works with 2D matrices.

It runs on the CPU's SIMD units.